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Abstract 

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — 
how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or 
large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely sampled outbreaks 
from genomic data while considering within-host diversity. We infer a time-labeled phytogeny using Bayesian evolution- 
ary analysis by sampling trees (BEAST), and then infer a transmission network via a Monte Carlo Markov chain. We find 
that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial un- 
certainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak 
displayed similar uncertainty, although the correct source case and several clusters of epidemiological^ linked cases were 
identified. We conclude that genomics cannot wholly replace traditional epidemiology but that Bayesian reconstructions 
derived from sequence data may form a useful starting point for a genomic epidemiology investigation. 



Introduction 

Infectious disease outbreak investigation typically involves 
both field work — interviews that establish links between pa- 
tients and/or exposures to a source of infection — and molec- 
ular epidemiology, in which laboratory typing of pathogen 
isolates is used to identify related cases. Data from both 
streams are considered together to reconstruct the outbreak, 
identifying its origins and pathways of onward transmission. 
Ideally, the reconstruction aids in the real-time management 
of the outbreak and guides public health policy and practice 
in preventing future occurrences. 

Molecular epidemiology has recently undergone a revolu- 
tion with the advent of next-generation genome sequencing. 
With the cost of sequencing a complete bacterial genome 
now comparable to that of a gold-standard typing analysis 
(Sabat et al. 2013), traditional molecular methods for investi- 
gating bacterial disease outbreaks are increasingly being 
replaced by genomics (Didelot, Bowden, et al. 2012; Koser 
et al. 2012; Didelot 2013; Le and Diep 2013). Techniques 
such as multilocus sequence typing interrogate approxi- 
mately 0.1% of a bacterial genome. In contrast, whole- 
genome sequencing permits identification of sequence vari- 
ation across the complete genome. Given the short timescales 
over which outbreaks typically occur, only a small number of 
single nucleotide changes are expected between outbreak 
isolates. This diversity is not captured by traditional typing 
methods but can be identified and leveraged for outbreak 
reconstruction using whole-genome approaches. 

A first approach to exploit this emergence of genomic data 
is to define a threshold for the number of genomic differences 
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above which direct transmission is unlikely (Eyre et al. 2013; 
Walker et al. 2013; Walker et al. 2014). This has the advantage 
of being a simple way to rule out when transmission did not 
happen, but reconstructing transmission when it did happen 
is less straightforward. Most reconstructions published to 
date rely heavily upon fieldwork data, and the utility of 
genome sequencing for inferring transmission events in the 
absence of these data — which, for a given outbreak, may be 
incomplete or unavailable — remains unclear, particularly for 
diseases characterized by periods of latency or chronic infec- 
tion, during which substantial within-host genetic diversity 
may arise. Although most investigations include a phyloge- 
netic analysis of the outbreak isolates, phylogenetic trees do 
not directly correspond to transmission trees and cannot, as 
such, identify specific person-to-person transmission events 
(Pybus and Rambaut 2009). In a densely sampled outbreak, all 
cases will appear as tips of the phylogenetic tree, when in fact 
some of these cases will have transmitted to others, so inter- 
nal branching events are also associated with sampled hosts. 

Nevertheless, it should be possible to infer transmission 
from sequence data using alternative approaches, and several 
such methods have been proposed (Cottam, Thebaud, et al. 
2008; Jombart et al. 2011; Lieberman et al. 2011; Morelli et al. 
2012; Ypma et al. 2012; Jombart et al. 2014). These methods 
typically identify transmission events as branching events in a 
phylogeny, and they do not consider within-host genetic di- 
versity. This simplification may be appropriate for pathogens 
with a fairly small generation time — the time that elapses 
between a host becoming infected and causing other 
infections — but not for those with long generation times, 
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latency, and carriage. In these organisms, one can expect sev- 
eral nucleotide changes to accrue within a single host, with 
different individual lineages being transmitted onward to sec- 
ondary cases. Examples include Clostridium difficile (Didelot, 
Eyre, et al. 2012; Eyre et al. 2013), Mycobacterium tuberculosis 
(Gardy et al. 2011; Walker et al. 2013), Staphylococcus aureus 
(Young et al. 2012; Golubchik et al. 2013; Harris et al. 2013), 
and Helicobacter pylori (Kennemann et al. 201 1; Didelot et al. 
2013). Excluding the possibility that multiple distinct genetic 
lineages of a pathogen — distinguished by only a few muta- 
tions — can be present within a host can lead to serious mis- 
interpretation of putative transmission events (Ypma et al. 
2013). 

Furthermore, current methods for transmission inference 
do not leverage the capabilities of time-calibrated phyloge- 
netic inference methods. Given the importance of factors 
such as branch length in inferring the underlying host contact 
network structure from a phylogeny (Robinson et al. 201 3), an 
inference method that incorporates sampling times and a 
molecular clock analysis is preferable to one using neighbor- 
joining, maximum parsimony, or other simplified tree- 
building algorithms. 

With the increasing use of genomic epidemiology, a 
method capable of inferring transmission events from geno- 
mic data alone while also considering within-host diversity 
offers an important tool for outbreak reconstruction. Here, 
we present a Bayesian inference scheme based on timed phy- 
logenetic trees that can be run independently of epidemio- 
logical data or weighted with whatever fieldwork data is 
available, such as timing and duration of infectivity, level of 
infectivity, and geographic location. Our method rests on a 
within-host pathogen population genetic model, under 
which different lineages may be transmitted to secondary 
cases. We first reconstruct a timed phylogenetic tree and 
then infer the underlying transmission network, given the 
observed phylogeny. This second step is achieved by coloring 
the branches of the phylogeny with a unique color for each 
host so that a change of color represents transmission 
between hosts (fig. 1). This two-step approach represents a 
formalization of a previous approach to reconstruct transmis- 
sion events on top of a timed phylogeny (Cottam, 
Wadsworth, et al. 2008). We make use of existing robust 
methods for phylogenetic inference, such as Bayesian evolu- 
tionary analysis by sampling trees (BEAST) (Drummond and 
Rambaut 2007) and ClonalFrame (Didelot and Falush 2007), 
and employ a novel Bayesian methodology to infer a trans- 
mission network via a Monte Carlo Markov chain (MCMC). 
We apply our method to simulated data to illustrate its per- 
formance, and to a real-world data set to reconstruct the 
transmission of M. tuberculosis in a densely sampled outbreak. 

Results 

Within-Host Diversity Affects Placement of 
Transmission Events on a Phylogenetic Tree 
To assess the impact of within-host diversity on the inference 
of transmission events from a phylogeny, we simulated 
a transmission tree — the set of specific person-to-person 




Fig. 1. The colored genealogical tree (left). Each host isolate corresponds 
to a unique color. A lineage is colored according to the host it was in at 
the corresponding time. When a lineage changes from color q to q 
(forward in time), this represents / infecting/ Each color may not persist 
in the tree after the time of the corresponding tip, because this is the 
recovery time of the host. The subtree restricted to a single color (right) 
is the part of the tree inside the corresponding host; lineages are taken 
from this tree at the recovery time and the times when the host infected 
others. 



transmission events within an outbreak — and genealogies 
arising from this tree under two scenarios of effective popu- 
lation size. We modeled an outbreak in a susceptible popu- 
lation of size N- 100, with a per year recovery rate y = 2 
and per year, per contact infectivity rate /3 = 0.02. The 
expected number of infected individuals in a susceptible- 
infectious-removed (SIR) model is R(oo), such that 
R(oo) = -jiog^^ 1 (Allen 2008), which here is 
R(oo) = 13.51. A transmission tree T was simulated under 
these conditions, with ten individuals infected (fig. 2A). 

If the effective population size within hosts is N e = 0, then 
within-host diversity is zero, and transmission events coincide 
exactly with coalescent events of the phylogeny (fig. 2B). This 
assumption simplifies the relationship between the transmis- 
sion tree T and the genealogy G and may be appropriate for 
infections with a short incubation time. However, in cases of 
latent disease, chronic infection, or long carriage periods, this 
assumption may not be valid. An example of this is asymp- 
tomatic carriage of S. aureus, which can persist within a host 
for months to years (Scanvic et al. 2001; Eveillard et al. 2004; 
Marschall and Muhlemann 2006) so that the recovery rate 
y = 2 above is realistic, but the parameter N^g, corresponding 
to the product of effective population size N e and generation 
time g (i.e., duration of a replication cycle), is on the order of 
100 days (Young et al. 2012; Golubchik et al. 2013). 

Simulating a genealogy G under the same transmission tree 
T but with N e g = 100/365 = 0.274 year illustrates the com- 
plex relationship between G and T that arises under a model 
of within-host diversity (fig. 2C), with transmission events no 
longer corresponding to coalescent events. In particular, it is 
now possible for the genomes from two hosts A and B to 
share a common ancestor more recently than with the 
genome from a third host C, even if C infected both A and 
B. An example of this in figure 2C is provided by host 1, who 
infected both hosts 2 and 3. In spite of this, 2 and 3's genomes 
are more closely related to each other than to 1's, because 
both 2 and 3 were infected by a lineage from 1 that is different 
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Fic. 2. (A) Transmission tree simulated with N= 100, y = 2, and /i = 0.02. The numbers in square brackets represent the time of infection and 
removal of each individual, respectively. (B) Genealogical tree simulated conditional on the transmission tree in (A) and with parameter N e g = 0. 
Transmission events are indicated by red stars and a change in branch color. (C) Same as in (B) but with parameter N e g = 0.274. 



from the one that was sampled. If no within-host diversity 
was assumed, the genealogy in figure 2C precludes the possi- 
bility of 1 infecting both 2 and 3. Under our more realistic 
model with N e g = 0.274, these transmissions become a 
possibility. 

MCMC Inference of a Transmission Network from a 
Phylogeny Captures Known Transmission Parameters 
and Events 

To determine whether the known events and parameters of 
transmission tree T could be inferred from a genealogy C, we 
applied our MCMC method to the simulated data set in 
figure 2C. The MCMC was run for 100,000 iterations with 
the first half (burn-in) discarded. This took approximately 
5min on a desktop computer, and several independent 
runs were compared to ensure good convergence and 
mixing of the chain. 



The inferred rate of infectivity /3 had posterior mean 0.021 
(95% CI 0.010-0.038) capturing the correct value of ft = 0.02, 
while the inferred rate of removal y had posterior mean 1.82 
(95% CI 0.88-3.17), also capturing the correct value of y = 2. 
The inferred effective population size Neg had posterior mean 
0.49 (95% CI 0.02-2.49), which included the correct value of 
N e g = 0.274, but that was also compatible with Neg values 
up to an order of magnitude higher or lower than the correct 
value. This result reflects the difficulty of precisely inferring 
Neg, especially as only ten individuals were infected. In a sep- 
arate simulation with ji equal to 0.05, such that R(oo) = 89 
hosts became infected, all parameters were more precisely 
reconstructed, with CIs for /3, y, and Neg of 1.9-2.4, 
0.04-0.06, and 0.19-2.11, respectively. 

The posterior sample of transmission trees inferred based 
on the phylogeny in figure 2C was summarized using a graph 
representing all transmission events with posterior probability 
above 10% (fig. 3A). All ten correct transmission events are 
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Fig. 3. (A) Network representation of the posterior distribution of transmission trees for the simulated data set shown in figure 2C. Nodes represent 
hosts, and the numbers within square brackets to mean inferred infection time and the known removal time. Edges represent a posterior probability of 
transmission of at least 10%, with a darker edge indicating a higher probability. (B) Point estimate of the transmission tree obtained by taking the 
optimal branching tree in the network shown in part (A). 



present, but incorrect events are also contained in the graph, 
particularly the reverse of correct edges, making it difficult to 
distinguish infector from infected. With 30% of the posterior 
probability, host 1 was correctly identified the most likely 
source case (fig. 2A), but hosts 2 and 3 were also likely sources, 
with 25% and 17%, respectively. Overall, 33% of the posterior 
probability weight was carried by correct edges, with the re- 
maining two-thirds of probability supporting transmission 
events that did not actually occur in the simulation. We 
used Edmonds' algorithm (Gibbons 1985) to find the optimal 
branching tree within the graph carrying the maximum pos- 
terior weight to identify the most likely transmission scenario. 
In the resulting point estimate of the transmission tree, 
seven of the ten simulated transmission events were recov- 
ered (fig. 3B). 

One hundred scenarios analogous to the one shown in 
figure 2 were generated, each using the same parameters /3, 
y, and and each including ten infected individuals. 
Inference was performed for each of the hundred simulated 
data sets. On average, we found that correct transmission 
events carried 27% of the posterior probability weight. In 
each simulation, there are only ten correct transmission 
events, including transmission from an external source to 
the first case, out of a total of a hundred possible events 
(each of the ten hosts can infect any of the other nine, plus 
ten choices for transmission from the external source to the 
first case). Correct edges therefore carry, on average, almost 
four times more weight than incorrect edges: (0.27/10)/ 
(0.73/100) = 3.7. 



Reconstruction of a M. tuberculosis Outbreak Data Set 
To assess the performance of our inference method on a real- 
world data set, we applied it to an outbreak of tuberculosis for 
which a hypothesized source case had been identified and for 
which epidemiological data supporting several further trans- 
mission events was available. A BEAST phylogeny (fig. 4A, 
supplementary fig. SI, Supplementary Material online) indi- 
cated that most of the inferred transmission events occurred 
after the source case (K02) recovered, meaning the source 
could have infected eight other people at most — eight being 
the number of unique lineages present in the tree at the 
source's recovery time — and that the source harbored signif- 
icant within-host diversity. The clock rate estimated by BEAST 
had a posterior mean of 1.15 x 10~ 7 (95%CI 0.39 x 10~ 7 
to 2.00 x 10~ 7 ), consistent with previous estimates in 
M. tuberculosis (Ford et al. 2011; Walker et al. 2013). 

We used the MCMC approach to infer transmission net- 
works under two scenarios: one based only on the phyloge- 
netic tree (fig. 4C) and one in which posterior probability 
weights were modified using epidemiological data (fig. 4D), 
including infectivity, infectious period, and geographic loca- 
tion (supplementary fig. S2, Supplementary Material online, 
shows the posterior mode in both cases). Although both 
graphs placed the suspected source K02 as central to the 
network, the modified network displayed both a lower clus- 
tering coefficient (0.483 vs. 0.596) and a higher average per- 
edge posterior probability (0.302 vs. 0.253), meaning that 
the modified network contained fewer bidirectional events, 
displayed a more web-like layout suggestive of waves of 
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Fig. 4. Application to a real-world tuberculosis outbreak. (A) Phylogenetic tree inferred by BEAST. (B) SNVs differentiating the isolates. 
(C) Transmission network inferred without epidemiological modification. (D) Transmission network inferred with epidemiological modification. 
In (C) and (D), edges are shown with width and shading proportional to their posterior probability, except edges with low probability that are omitted. 
In (A), the phylogenetic tree is colored according to the consensus transmission tree from (D) and using the same unique colors for each host as in 
(C) and (D). 
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transmission rather than chains, and returned more high- 
probability events. 

Comparing inferred transmission events to known epide- 
miology revealed that both the basic and modified networks 
did capture aspects of known epidemiology, but in different 
ways. Among eight contacts sharing a sleeping space with the 
source case, three were identified by both reconstructions, 
two were present only in the basic reconstruction, and one 
was present only in the modified reconstruction. Another 
reported household contact between two cases was not re- 
turned in the basic reconstruction but was present in the 
modified version. Both inferred networks recapitulated a clus- 
ter of three cases linked to a specific sleeping location; in the 
basic model, the cluster appears as a complete subgraph with 
roughly equal probabilities for all six edges, while in the mod- 
ified version, it is clear that a single individual infected two 
other cases. We also examined the 15 transmission events 
with the highest probabilities in the modified reconstruction 
and found that three were well supported by epidemiological 
information, seven were possible given the known locations 
and contacts of cases, two did not appear to have an epide- 
miological linkage between the predicted infector/infected, 
and three were highly unlikely. 

The three unlikely transmissions involved the three cases 
known to reside in a different geographic location than the 
other outbreak cases and thought to have been infected by a 
visiting traveller from the outbreak community. In the basic 
reconstruction, this scenario is mostly recapitulated — the 
traveller infects two further cases. In the modified version, 
despite including an adjustment for geographic location, 
the traveller does not infect any of the other three cases 
nor are they infected by individuals with a travel history to 
the other community. 

The parameters estimated without and with epide- 
miological modification were similar, with 
(Neg&Y) = (1-48,0.0037,1.29) and (1.21, 0.0032, 1.23) re- 
spectively. This value of the removal rate y indicates that, on 
average, 1 5 months elapsed between infection of a host and 
its detection. We interpret the transmission rate /3 in terms of 
the basic reproduction number R 0 = /3N/y, the expected 
number of secondary infections caused by a case in a suscep- 
tible population. Because N = 400, we calculate R 0 =1.15 
(95%CI 0.73-1.9) in the basic reconstruction and R 0 =1.03 
(95%CI 0.65-1.69) in the modified reconstruction. This rela- 
tively low number is expected for tuberculosis in a low- 
burden, highly resourced setting like the one studied here 
(Dye and Williams 2000). 

For simplicity, the networks described above were based 
on a single phylogeny, namely, the maximum credibility tree 
returned by BEAST. However, there was significant uncer- 
tainty in the BEAST posterior, as illustrated by a DensiTree 
plot (Bouckaert 2010) (supplementary fig. S3, Supplementary 
Material online). We therefore inferred transmission trees 
separately for each of the 100 trees sampled by BEAST, and 
the aggregated results are shown in supplementary figures S4 
and S5, Supplementary Material online, for the application 
without and with epidemiological modification, respectively. 
In the case of the latter, accounting for the uncertainty in the 



phylogenetic tree does not result in a great increase in uncer- 
tainty for the reconstructed transmission tree. 

Discussion 

We present a Bayesian inference method for reconstructing 
transmission events in a densely sampled outbreak using 
time-labeled genomic data. By modeling within-host evolu- 
tion as a neutral coalescent process, we are able to account for 
within-host genetic diversity and accurately ascribe infection 
events to a host harboring multiple lineages of a pathogen — 
an issue critical to the reconstruction of diseases with latent 
or asymptomatic carriage periods and/or chronic infection. 
We first reconstruct a phylogeny, leveraging the strengths of 
existing packages for phylogenetic inference such as BEAST 
(Drummond and Rambaut 2007), and next infer a transmis- 
sion tree using an efficient MCMC procedure. 

Recently, an approach was proposed to relate phylogenetic 
trees to transmission trees (Ypma et al. 2013), using a similar 
decomposition to ours; they simultaneously inferred a phy- 
logeny and a transmission tree for a data set on 12 farms 
infected with foot-and-mouth disease (Morelli et al. 2012). 
Their focus was on sampling and joint inference given a 
specific within-host population model and a sharply defined 
incubation period. In contrast, our method is very rapid and 
can be applied to large (densely sampled) outbreaks, it is 
applicable to infections with long and variable infectious pe- 
riods, it exploits the capabilities of BEAST, and infers rather 
than specifies the in-host effective population size. We are 
also able to flexibly incorporate additional field epidemiolog- 
ical data of various types. 

Our results demonstrate that even if the genomic data was 
perfectly informative about the phylogeny, as in our simula- 
tions, considerable uncertainty is present in the transmission 
network when a realistic model of within-host evolution is 
taken into account. Although true events are captured, they 
are often eclipsed by other potential events and cannot be 
completely identified, even with methods that extract the 
subgraph with maximal posterior probability weights. This 
contrasts with previous approaches to outbreak reconstruc- 
tion that equated phylogenetic branching with transmission 
and may be too assertive when within-host diversity exists. 
Our Bayesian approach captures this uncertainty. In practice, 
another source of uncertainty comes from the fact that the 
phylogeny is never known exactly but only informed to some 
extent by the presence of genetic variations between the 
sequenced genomes. This uncertainty in the phylogenetic 
inference is fully captured by BEAST, which returns a 
sample of trees from the posterior distribution given the ge- 
nomic data. Accounting for this phylogenetic uncertainty 
into the inference of the transmission tree can then be 
achieved by applying our method to each tree in the 
BEAST posterior sample (Nylander et al. 2008; Parker et al. 
2008). 

When applied to a real-world data set, the method based 
only on genomic data correctly inferred correctly inferred 
the most likely source case and several key transmission clus- 
ters supported by field data. We found however that the 
inference was significantly improved when it was based on 
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both genomic and epidemiological field data, something that 
our Bayesian method is well suited to accommodate. Any 
epidemiological information, or indeed a different model, 
can indeed be integrated into the prior for the transmission 
tree without affecting the remainder of the method. This 
illustrates the important point that genome sequencing in 
a public health context does not make epidemiological data 
redundant, but instead should be seen as a complementary 
stream of information. Even in the future when sequencing 
will become so cheap that multiple genomes can be se- 
quenced from all hosts, we believe that pure genomic epide- 
miology will not replace classical epidemiology. It is also 
important to note that, in a retrospective study, overall pat- 
terns of transmission are of more interest to public health 
than precise inference of who infected whom. Our method 
suggests that genomic information can shed light on whether 
outbreaks tend to occur as long chains of transmission or as 
large bursts from single hosts, and whether transmissions 
tend to be associated with specific environments or hosts, 
even if individual transmission events remain uncertain. 

Like any model-based statistical analysis, our method 
makes a number of assumptions, some of which might not 
be appropriate for every application. The within-host evolu- 
tionary dynamic is modeled by a simple coalescent process 
with constant population size (Kingman 1982). This could 
easily be relaxed to allow for variation in the within-host 
population size following infection while staying in a coales- 
cent framework (Griffiths and Tavare 1994). However, our 
specification represents a natural choice because it is the 
simplest possible choice of model with only a single param- 
eter Ngg. Very little is known about the within-host evolution- 
ary dynamics of most infectious diseases, but the use of a 
more complex model could be justified, for example, if it 
led to a better marginal likelihood as measured by a Bayes 
factor (Kass and Raftery 1995). Our model also assumes that 
the transmission bottleneck is complete. This would be more 
difficult to relax as any other choice would imply that some 
genetic diversity rather than a single variant can be transmit- 
ted from infector to infected, and therefore that the common 
ancestor of two isolates from the same host might be in a 
different host. Again, very little is known about this property 
for most infectious diseases, but among the few microorgan- 
isms where this was formally investigated, data suggest the 
bottleneck is very strong if not complete, for example, in HIV 
(Edwards et al. 2006; Keele et al. 2008; Haaland et al. 2009; 
Boeras et al. 2011). 

The main limitation of the method as presented here is 
that it assumes that all cases comprising an outbreak have 
been sampled. This assumption was acceptable for the tuber- 
culosis application, given the surveillance and reporting 
systems in place in this setting, but clearly would not be 
appropriate in all applications — for example, diseases with 
milder symptoms where cases may go unrecognized. 
However, our two-step approach provides a good starting 
point to investigate these outbreaks too. The phylogeny in- 
ferred in the first step does not assume full sampling and the 
coloring in the second step could be redefined so that a color 
corresponds not just to a single host but rather to a host plus 



all intermediate cases up to the next sampled case. Or alter- 
natively, the number of colors could be allowed to vary and be 
greater than the number of sampled individuals, which would 
require the use of a reversible-jump MCMC (Green 1995) to 
deal with the resulting transdimensional parameter space. 
Additional parameters, such as the proportion of unsampled 
infected hosts, would be required to modify the model in this 
way, but the general approach of coloring the phylogeny will 
remain valid. 

Materials and Methods 

Model for the Epidemiological Process between Hosts 
For the epidemic between-host spreading process, we con- 
sider a stochastic, continuous time Markov chain (CTMC) 
version of the general SIR epidemic model (Allen 2008). In a 
population of known size N, the parameters of this process 
are the infection rate /3 and removal rate y. If the state of the 
process at time t is (S t ,/ t ,R t ), denoting the number of suscep- 
tible, infected, and removed individuals, respectively, then 
transition to the state (S t — 1,/ t + 1,R t ) happens at rate 
ps t l t and transition to the state (S t ,/ t — 1,R t + 1) happens 
at rate yl t . The former event is a transmission and the latter 
a removal. The process is started in the state 
(S 0 = N — 1,/ 0 = 1,R 0 = 0) and is run until the epidemic 
finishes at l t = 0. 

This basic process has been extensively studied, especially 
in a Bayesian framework where inference can be performed 
using an MCMC (O'Neill and Roberts 1999; O'Neill 2002). Let 
n denote the number of individuals who have been infected, 
that is, the number of removed individuals at the end of the 
process. Let T denote the transmission tree, where each node 
is one of the n infected individuals and edges represent trans- 
mission events. In our notation, T contains the information of 
who infected whom and when, as well as when individuals 
were removed. Simulation of a transmission tree T given 
the parameters N, /3, and y can be done using the Gillespie 
algorithm (Gillespie 1976). 

Model for the Genealogical Process within Hosts 
We consider that when each individual is removed, a single 
genome from the infectious agent is isolated and genotyped. 
The genealogical relationships between these genomes can be 
represented as a timed genealogy C, where leaves correspond 
to the n genomes (one from each host) and internal nodes 
represent common ancestors of the genomes. The genealogy 
G not only depends on the transmission tree T but also de- 
pends on the within-host evolutionary dynamics (Alizon et al. 
2011) and the transmission bottleneck (Bergstrom et al. 1999; 
Grenfell et al. 2004). For simplicity, we model within-host 
evolution as a neutral coalescent process (Kingman 1982) 
with constant population size N e and average generation 
time g, which corresponds to the duration of a pathogen 
replication cycle. In this model, any two lineages within a 
host coalesce at constant rate Ng, which is therefore the 
sole within-host parameter of interest. Furthermore, the 
transmission bottleneck is assumed to be complete so that 
only a single genomic variant is transmitted from infector 
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to infected. This transmitted genome is a random sample 
from the infector's pathogen population at the time of 
transmission. 

Under these assumptions, the genealogy C can be simu- 
lated given a transmission tree T and parameter Ngg as fol- 
lows. In a first step, n subtrees are generated corresponding to 
the genealogical process within each host, and in a second 
step, these subtrees are pasted together in order to produce 
C. For the first step, the within-host genealogy of each host i is 
simulated independently of each other. The genealogy within 
host i has a single root at the time when /' was infected, a leaf 
at the time when / was removed, and a leaf for each host that / 
infected at the time when these transmission events hap- 
pened. Each within-host genealogy is generated using the 
coalescent with temporally offset leaves (Drummond et al. 
2002) with coalescence rate Ngg and using rejection sampling 
on the coalescent tree (Tavare et al. 1997) to ensure that a 
single lineage exists at the time when / became infected. For 
the second step, the pasting is done for each transmission 
event. For example, if A infected B at time t in the transmis- 
sion tree T, then the subtree of A contains a leaf at time t and 
the subtree of B has its root at time t, and pasting is done 
between this leaf and this root. Repeating this pasting for 
all transmission events completes the generation of the 
genealogy C. 

Bayesian Decomposition 

Our approach assumes that all n infected individuals are 
known and that their removal times are also known. Based 
on the genomes sampled from each of the n individuals, the 
timed genealogy G is reconstructed, and for the time being, 
we assume that C is known exactly. Because the times of the 
leaves of C represent the removal times of the n individuals, 
our notation C includes these removal times but not the 
times of infection. Due to the within-host genealogical pro- 
cess, transmission events do not occur at branching points in 
the genealogy. The aim of the inference is to infer the epide- 
miological parameters /? and y, the within-host evolutionary 
parameter and the transmission tree T. The posterior 
distribution of interest is therefore 

P(/3,y,N e g,T | C) cx P(C | N e g,T)P(T | ^,y)P(^)P(y)P(N e g). 

0) 

The last three terms represent the prior distribution of the 
parameters /?, y, and Ng, which we take to be exponential^ ), 
and it remains to specify P(C | N e g,T) and P(T | fi,y). 

First, the term P(C | N e g,T) in equation (1) represents the 
probability of the observed genealogy given the transmission 
tree and within-host parameter Ngg. We relate a transmission 
tree T to the genealogy C by associating each point on C to a 
host. This can be visualized as a "coloring" of the branches in 
C: each host / is represented with a unique color q, and a 
branch segment is colored with c, if it corresponds to evolu- 
tion that happened within host /'. Transmission events there- 
fore correspond to point on C where the color changes and 
do not, in general, coincide with branching times in the ge- 
nealogy. The same host may harbor several lineages ancestral 



to our sample at the same time, which can be visualized on 
the tree. Figure 1 illustrates this visualization and our notation. 
For a given host i, we can consider the tree G, obtained by 
looking only at the branches of C that are colored with q. This 
tree G, corresponds to the evolutionary process within host /' 
and has a number of leaves n, equal to one plus the number of 
hosts infected by /'. 

To find P(C | N e g,T), we exploit the fact that the subtrees 
G, correspond to evolution within each host / = 1..n, and so 
are independent of each other. Because the distribution is 
conditional on T, the dates d,j of the j - 1..n ; leaves in G; are 
known (corresponding to all transmissions from i plus the 
removal of /'). The date r, of the root of G : is also known 
because it corresponds to the infection of host i. This leads 
to the decomposition: 

n 

P(G | N e g,T) = Y\ P(Cj | N e g,d u ,rd. (2) 

The term P(G, | N e g,djj,rj) is the probability under the coa- 
lescent model with rate Ngg of the timed genealogy G, given 
the dates of its leaves, dy, and conditional on having only one 
ancestor by the time r, of the infection of host /'. If this last 
condition was not present (i.e., if r, was a very long time ago), 
then this distribution would be that of the coalescent with 
temporally offset leaves as described by Drummond et al. 
(2002). The additional condition corresponds to the fact 
that the most recent common ancestor (MRCA) of the 
leaves has to be more recent than the date of infection 
because the transmission bottleneck is absolute. 

To calculate P(G, | N e g,d\j,r\), we consider the n : leaves in 
increasing order of age. For each leaf j, we consider adding it 
on the genealogy formed by the previous leaves 1 ..j — 1. The 
first (most recent) leaf corresponds to a linear genealogy with 
probability 1. The second leaf has to coalesce with the linear 
genealogy of the first leaf before the infection time r,. The 
third leaf has to coalesce with the genealogy formed by the 
first two leaves before the infection time r„ etc. This process is 
repeated for all leaves so that the genealogy formed at the end 
is the complete G,. When adding leaf j, let A ; denote the sum 
of branch lengths in the genealogy formed so far between the 
time of leaf j and the time where it coalesces with an ancestor 
of a previously considered leaf. Similarly, let By denote the sum 
of branch lengths between the time of leaf j and the timer,. Aj 
is exponentially distributed with parameter Ngg, but this 
distribution is truncated by A ; < Bj to ensure that coales- 
cence happens before time r,. We therefore deduce that 
P(G, | N e g,d r/) can be calculated as follows: 



P(G -| N e g,d /J=n .. ni ,r,) = Y\ 



exp(-y(N t g)) 
N e g(1 - exp(-Bj/(N e g))y 



(3) 



This now completely specifies P(G | N e g,T) from equation (1 ). 

The term P(T | /3,y) in equation (1) represents the prob- 
ability of the full epidemiological process given the epidemi- 
ological parameters. T includes the times of all infection and 
removal times. Let t, =1 .. 2n represent the list of all these event 
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times (i.e., n infections plus n removals), sorted chronologi- 
cally, and let e i=1 2n be equal to 0 for infections and 1 for 
removals. At any time t, the number of susceptible, infected, 
and removed individuals is given by 

(4) 

By considering the probability of each event in turn, we 
deduce that 

P(T|Ay) = nG8/ fi S t ,.(l-e,) 

i-2 (?) 

+ )// t ,.e;)exp(-(j6S t ,./ ti + y\ t .){t; - t,_i))// ti . 

The denominator in equation (5) comes from the fact that T 
contains not only the states (S t ,/ t ,R t ) but also knowledge of 
who was the infector when an infection occurred and who 
was removed when a removal occurred, for both of which 
there is a uniform choice among the / ti -infected individuals at 
time t,. 

We now have all the components of equation (1). 
Inference proceeds using an MCMC approach. 

MCMC Moves 

For the parameters of the epidemiological process fi and y, 
we use the Cibbs moves introduced by O'Neill and Roberts 
(1999). These moves rely on the fact that the distribution 
P(T | /6,y) in equation (5) has for conjugate prior a gamma 
distribution so that the posterior of these parameters is also a 
gamma distribution. 

For the parameter of the within-host evolutionary 
process Ngg, we use a Metropolis move such that the 
new value is equal to the old one plus a draw from 
Uniform([— e, e]). This move is accepted with probability 
equal to the ratio of equation (2) after and before the 
move, times the ratio of prior probability of Ngg after and 
before the move. 

To update the transmission tree T, a single infection event 
is first chosen at random. If the infection is to the source case, 
then its date is proposed to be modified by a draw from 
Uniform([— e, e]) (excluding values that would make the 
transmission to the source case become more recent than 
the common ancestor of the genealogy C). Otherwise, if the 
transmission is not to the first case, then it corresponds to a 
transmission event, that is, a point at which two colors meet 
on the genealogy. The proposed MCMC update consists of 
moving this transmission event uniformly at random to an- 
other point on the genealogy where it gives a valid coloring, 
that is, one where 1) there are only n - 1 color changes, 2) 
each leaf is colored in the color c, of the host it corresponds to, 
and 3) the color q does not exist in the tree after the leaf 
corresponding to host i. We note that this move may impact 
the meaning not only of the transmission event being moved 
but also of other transmission events on the tree so that the 



validity of the whole configuration has to be checked. This 
proposed move is symmetric, and it is accepted with proba- 
bility equal to the ratio of the products of equations (2) and 
(5) after and before the move. In the supplementary infor- 
mation, Supplementary Material online, we discuss the sym- 
metry in more detail and show that the resulting Markov 
chain is irreducible. 

The computational time taken to perform a single iteration 
of the MCMC scales linearly with the number n of infected 
individuals, but the number of iterations required to achieve 
good convergence and mixing properties of the MCMC may 
vary accordingly. Our implementation of this MCMC algo- 
rithm is freely available as a Matlab package at http://code. 
google.eom/p/transphylo/ (last accessed April 12, 2014). 

Application of the Method to a Real-World 
M. tuberculosis Outbreak Data Set 
Whole genomes of 33 M. tuberculosis were sequenced on an 
lllumina HiSeq and short reads were aligned against the 
M. tuberculosis CDC1551 reference sequence using Burrows- 
Wheeler Aligner (BWA; Li and Durbin 2009). SAMtools (Li 
et al. 2009) mpileup with default parameters identified 20 
high-confidence single nucleotide variant (SNV) positions 
that differed among the isolates, defined as positions called 
with a quality score of 222, genotype quality of 99, and no 
indication of strand basis or low depth of coverage. 
Recognizing that repetitive regions frequently contain erro- 
neous variant calls, we did not include a SNV if it was located 
within 50 bp of another SNV. Phylogenetic inference was per- 
formed with BEAST (Drummond and Rambaut 2007) using 
the concatenated SNV data labeled with sampling dates. A 
coalescent model with constant population size (Kingman 
1982) was used for the tree prior. More complex priors 
such as a coalescent with exponential growth prior 
(Griffiths and Tavare 1994) or a birth-death epidemiological 
prior (Stadler et al. 2012) did not make a significant difference 
to the resulting phylogeny. 

Inference under the basic scenario used equation (5) for 
the prior on the transmission tree with N - 400, reflective of 
the number of individuals in the outbreak community esti- 
mated to be at risk of tuberculosis exposure. The second 
scenario modified equation (5) to incorporate known aspects 
of the outbreak's epidemiology, including the geographic lo- 
cation of a host, host smear status (positive or negative), and 
whether the transmission event in a proposed transmission 
network happened at a time that a host was known to be 
noninfectious (e.g., at a time before a negative tuberculosis 
skin test). We modeled infection at a rate (/3 + / + + P~I~)S 
and recovery at a total rate y(l + + for which / + and l~ 
are the numbers of smear-positive and -negative cases. We 
penalized transmission trees in which cases transmit before 
they are thought to have become infectious based on their 
clinical history, trees in which cases receive infection before 
the time of their most recent negative tuberculosis skin test, 
and trees in which transmission occurs between cases not 
known to have been in the same location. 
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This results in modifications to equation (5), although the 
principle is the same 



1 r ki (t) 



R(t)e 



-R(t)(t,-tM) 



Pk : (t) R(t) 

=n^(oe- R(txt '- ti - i) 



i=2 
2n 



(6) 



where R(t) is the total rate at which events occur, so that 
R(t)e _R(t)(ti_t| - l) is the probability that time t; - t/_i elapsed 
between events; ^ is the probability of the event being type 

kj (for example, given that an event occurred after a time 
interval t, — t;_i, the probability that it was a recovery is 

y(l+ + i Y ) ( l + (fS Vi + \p-i- )s )> and the population to which the 
event applies with uniform probability is P kj . For recoveries 
P kj = / + + /~ and for infection events, it is either / + or l~ 
corresponding to whether a smear-positive or -negative case 
was the infector. The penalty S ki for the event is either 1 (no 
penalty), 0.05 if the event contradicts the timing of cases' 
clinical histories, or 0.005 for a transmission event that oc- 
curred when cases were in different locations with no known 
travel of either case. This reflects a (small) probability that 
data for cases' locations did not incorporate all travel. 

The resulting transmission networks for both scenarios 
were visualized using Cytoscape v.3.0.2 (Smoot et al. 2011), 
using a yFiles organic layout and visually encoding the poste- 
rior probability of an edge using edge coloring and width. 
Cytoscape's NetworkAnalyzer module was used to calculate 
network features, including the clustering coefficient 
(Barabasi and Oltvai 2004) C n of a given node n in a directed 
graph: 



C n = e„/(k n (k„ - 1)), 



(7) 



where k n is the number of neighbors of node n and e n is the 
number of links between these k n neighbors. 

Supplementary Material 

Supplementary figures S1-S5 are available at Molecular 
Biology and Evolution online (http://www.mbe.oxfordjour 
nals.org/). 
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